Table of Contents
1 Archives and DataSets
2 Archive Format
2.1 Basic Usage
2.2 Large Arrays
2.3 Importable Archives
3 Archive Details
3.1 Single-item Archives
3.2 Containers, Duplicates, and Circular References
4 Archive Examples
4.1 Non-scoped (flat) Format
4.1.1 Scoped Format
5 DataSet Format
6 DataSet Examples
Archives and DataSets¶
There are two main classes provided by the persist
module: persist.Archive
and persist.DataSet
.
Archives deal with the linkage between objects so that if multiple objects are referred to, they are only stored once in the archive. Archives provide two main methods for to serialize the data:
- Via the
str()
operator which will return a string that can be executed to restore the archive. - Via the
Archive.save()
method which will export the archive to an importable python package or module.
DataSets use archives to provide storage for multiple sets of data along with associated metadata. Each set of data is designed to be accessed concurrently using locks.
Archive Format¶
The persist.Archive
object maintains a collection of python objects that are inserted with persist.Archive.insert()
. This can be serialized to a string and then reconstituted through evaluation:
Basic Usage¶
[1]:
from persist.archive import Archive
a = 1
x = range(2)
y = range(3) # Implicitly reference in archive
b = [x, y, y] # Nested references to x and y
# scoped=False is prettier, but slower and not as safe
archive = Archive(scoped=False)
archive.insert(a=a, x=x, b=b)
# Get the string representation
s = str(archive)
print(s)
from builtins import range as _range
_g3 = _range(0, 3)
x = _range(0, 2)
b = [x, _g3, _g3]
a = 1
del _range
del _g3
try: del __builtins__, _arrays
except NameError: pass
[2]:
d = {}
exec(s, d)
print(d)
assert d['a'] == a
assert d['x'] == x
assert d['b'] == b
assert d['b'][1] is d['b'][2] # Note: these are the same object
{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}
Large Arrays¶
If you have large arrays of data, then it is better to store those externally. To do this, set the array_threshold
to specify the maximum number of elements to store in an inline array. Any large array will be stored in Archive.data
and not be included in the string representation. To properly reconstitute the archive, this data must be provided in the environment as a dictionary with key Archive.data_name
which defaults to _arrays
:
[3]:
import os.path, tempfile, shutil, numpy as np
from persist.archive import Archive
a = 1
x = np.arange(10)
y = np.arange(20) # Implicitly reference in archive
b = [x, y]
archive = Archive(scoped=False, array_threshold=5)
archive.insert(a=a, x=x, b=b)
# Get the string representation
s = str(archive)
print(s)
print(archive.data)
x = _arrays['array_0']
b = [x, _arrays['array_1']]
a = 1
try: del __builtins__, _arrays
except NameError: pass
{'array_0': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'array_1': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])}
To evaluate the string representation, we need to provide the _arrays
dictionary:
[4]:
d = dict(_arrays=archive.data)
exec(s, d)
print(d)
{'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'b': [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])], 'a': 1}
To store the data, use Archive.save_data()
:
[5]:
import os.path, tempfile
tmpdir = tempfile.mkdtemp() # Make temporary directory for data
datafile = os.path.join(tmpdir, 'arrays')
archive.save_data(datafile=datafile)
print(tmpdir)
!ls $tmpdir/arrays
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp5jutcwlt
array_0.npy array_1.npy
Importable Archives¶
(New in version 1.0)
Archives can be saved as importable packages using the save()
method. This will write the representable portion of the archive as an importable module with additional code to load any external arrays. Archives can be saved as a full package – a directory with a <name>/__init__.py
file etc. or as a single <name>.py
module. These can be imported without the persist
package:
[6]:
import os.path, sys, tempfile, shutil, numpy as np
from persist.archive import Archive
tmpdir = tempfile.mkdtemp()
a = 1
x = np.arange(10)
y = np.arange(20) # Implicitly reference in archive
b = [x, y]
archive = Archive(array_threshold=5)
archive.insert(a=a, x=x, b=b)
archive.save(dirname=tmpdir, name='mod1', package=True)
archive.save(dirname=tmpdir, name='mod2', package=False)
!tree $tmpdir
sys.path.append(tmpdir)
import mod1, mod2
sys.path.pop()
for mod in [mod1, mod2]:
assert mod.a == a and np.allclose(mod.x, x)
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmp9ahg71yq
|-- mod1
| |-- __init__.py
| `-- _arrays
| |-- array_0.npy
| `-- array_1.npy
|-- mod2.py
`-- mod2_arrays
|-- array_0.npy
`-- array_1.npy
3 directories, 6 files
Archive Details¶
Single-item Archives¶
(New in version 1.0)
If an archive contains a single item, then the representation can be simplified so that importing the module will result in the actual object. This is mainly for use in DataSet
s where it allows us to have large objects in a module that only get loaded if explicitly imported. In this case, one can also omit the name when calling Archive.save()
as it defaults to the name of the single item.
[7]:
import os.path, sys, tempfile, shutil, numpy as np
from persist.archive import Archive
tmpdir = tempfile.mkdtemp()
x = np.arange(10)
y = np.arange(20) # Implicitly reference in archive
b = [x, y]
archive = Archive(single_item_mode=True, array_threshold=5)
archive.insert(b1=b)
archive.save(dirname=tmpdir, package=True)
archive = Archive(scoped=False, single_item_mode=True, array_threshold=5)
archive.insert(b2=b)
archive.save(dirname=tmpdir, package=False)
!tree $tmpdir
sys.path.append(tmpdir)
import b1, b2
sys.path.pop()
for b_ in [b1, b2]:
assert np.allclose(b_[0], x) and np.allclose(b_[1], y)
!rm -rf $tmpdir
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmpval74y_i
|-- b1
| |-- __init__.py
| `-- _arrays
| |-- array_0.npy
| `-- array_1.npy
|-- b2.py
`-- b2_arrays
|-- array_0.npy
`-- array_1.npy
3 directories, 6 files
Note what is happening here… although we explicitly import b1
, the result of this is that b1 = [x, y]
is the list rather than a module. This behaviour is somewhat of an abuse of the import system, so you should not expose it too much. The use in DataSet
is that these modules are included as submodules of the DataSet package, acting as attributes of the top-level package, but only being loaded when explicitly imported to limit memory usage etc.
Containers, Duplicates, and Circular References¶
The main complexity with archives is with objects like lists and dictionaries that refer to other objects: all object referenced in such “containers” need to be stored only one time in the archive. A current limitation is that circular dependencies cannot be resolved. The pickling mechanism provides a way to restore circular dependencies, but I do not see an easy way to resolve this in a human-readable format, so the current requirement is that the references in an object form a directed acyclic graph (DAG).
Archive Examples¶
Here we demonstrate a simple archive containing all of the data. We start with the simplest format which is obtained with scoped=False
:
Non-scoped (flat) Format¶
We start with the scoped=False
format. This produces a flat archive that is easier to read:
[8]:
import os.path, tempfile, shutil
from persist.archive import Archive
a = 1
x = range(2)
y = range(3) # Implicitly reference in archive
b = [x, y, y] # Nested references to x and y
archive = Archive(scoped=False)
archive.insert(a=a, x=x, b=b)
# Get the string representation
%time s = str(archive)
print(s)
CPU times: user 778 µs, sys: 210 µs, total: 988 µs
Wall time: 927 µs
from builtins import range as _range
_g3 = _range(0, 3)
x = _range(0, 2)
b = [x, _g3, _g3]
a = 1
del _range
del _g3
try: del __builtins__, _arrays
except NameError: pass
Note that intermediate objects not explicitly inserted are stored with variables like _g#
and that these are deleted, so that evaluating the string in a dictionary gives a clean result:
[9]:
# Now execute the representation to get the data
d = {}
exec(s, d)
print(d)
d['b'][1] is d['b'][2]
{'x': range(0, 2), 'b': [range(0, 2), range(0, 3), range(0, 3)], 'a': 1}
[9]:
True
The potential problem with the flat format is that to obtain this simple representation, a graph reduction is performed that replaces intermediate nodes, ensuring that local variables do not have name clashes as well as simplifing the representation. Replacing variables in representations can have performance implications if the objects are large. The fastest approach is a string replacement, but this can make mistakes if the substring appears in data. The option robust_replace
invokes the
python AST parser, but this is slower.
Scoped Format¶
To alleviate these issues, the scoped=True
format is provided. This is visually much more complicated as each object is constructed in a function. The advantage is that this provides a local scope in which objects are defined. As a result, any local variables defined in the representation of the object can be used as they are without worrying that they will conflict with other names in the file. No reduction is performed and no replacements are made, makeing the method faster and more
robust, but less attractive if the files need to be inspected by humans:
[10]:
archive = Archive(scoped=True)
archive.insert(a=a, x=x, b=b)
# Get the string representation
%time s = str(archive)
print(s)
CPU times: user 412 µs, sys: 22 µs, total: 434 µs
Wall time: 452 µs
def _g3():
from builtins import range
return range(0, 3)
_g3 = _g3()
def x():
from builtins import range
return range(0, 2)
x = x()
def b(_l_0=x,_l_1=_g3,_l_2=_g3):
return [_l_0, _l_1, _l_2]
b = b()
a = 1
del _g3
try: del __builtins__, _arrays
except NameError: pass
DataSet Format¶
This is the new format of DataSets starting with revision 1.0.
A DataSet is a directory with the following files:
_this_dir_is_a_DataSet
: This is an empty file signifying that the directory is a DataSet.__init__.py
: Each DataSet is an importable python module so that the data can be used on a machine without thepersist
package. This file contains the following variable:_info_dict
: This is a dictionary/namespace with string keys (which must be valid python identifiers) and associated data (which should in general be small). These are intended to be interpreted as meta-data.
For the remainder of this discussion, we shall assume that _info_dict
contains the key 'x'
.
x.py
: This is the python file responsible for loading the data associated with the key'x'
in_info_dict
. If the size of the array is less than thearray_threshold
specified in theDataSet
object, then the data for the arrays are stored in this file, otherwise this file is responsible for loading the data from an associated file.x_data.*
: If the size of the array stored inx
is larger than thearray_threshold
, then the data associated withx
is stored in this file/directory which may be an HDF5 file, or a numpy array file.
These DataSet modules can be directly imported. Importing the top-level DataSet will result in a module with the _info_dict
attribute containing all the meta data. The data items become available when you explicitly import them.
DataSet Examples¶
[11]:
import os.path, pprint, sys, tempfile, shutil, numpy as np
from persist.archive import DataSet
tmpdir = tempfile.mkdtemp() # Make temporary directory for dataset
print("Storing dataset in {}".format(tmpdir))
a = np.arange(10)
x = np.arange(15)
ds = DataSet('dataset', 'w', path=tmpdir, array_threshold=12, data_format='npy')
ds.a = a
ds.x = [a, x]
ds['a'] = "A small array"
ds['x'] = "A list with a small and large array"
!tree $tmpdir
del ds
ds = DataSet('dataset', 'r', path=tmpdir)
print(ds['a'])
print(ds['x'])
print(ds.a) # The arrays a and x are not actually loaded until here
print(ds.x)
Storing dataset in /var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6
/var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmplw345fr6
`-- dataset
|-- __init__.py
|-- _this_dir_is_a_DataSet
|-- a.py
|-- x.py
`-- x_data
`-- array_0.npy
2 directories, 5 files
A small array
A list with a small and large array
[0 1 2 3 4 5 6 7 8 9]
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])]
As an alternative, you can directly import the dataset without the need for the persist
library. This also has the feature of delayed loading:
[12]:
sys.path.append(tmpdir)
import dataset # Only imports _info_dict at this point
print
print('import dataset: The dataset module initially contains')
print(dir(dataset))
import dataset.a, dataset.x # Now we get a and x
print
print('import dataset.a, dataset.x: The dataset module now contains')
print(dir(dataset))
sys.path.pop()
shutil.rmtree(tmpdir) # Remove files
import dataset: The dataset module initially contains
['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict']
import dataset.a, dataset.x: The dataset module now contains
['__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_info_dict', 'a', 'x']